Introduction¶

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

About the Dataset of Google Play Store Apps & Reviews¶

Data Source:
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed here.

Import Statements¶

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML  # so we can use .display() instead of print(), and render HTML as well

Notebook Presentation¶

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format # only works for float though, no thousand separator for int

# Code if we want to format integers with a thousand separator
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
    return {
        column:format 
        for column, dtype in df.dtypes.items()
        if dtype in [ np.dtype('int64'), np.dtype('int32') ] 
    }

Read the Dataset¶

In [3]:
df_apps = pd.read_csv('apps.csv')

Data Cleaning¶

Challenge: How many rows and columns does df_apps have? What are the column names? Look at a random sample of 5 different rows with .sample().

In [4]:
df_apps.shape
Out[4]:
(10841, 12)
In [5]:
df_apps.columns
Out[5]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')
In [6]:
df_apps.sample(5) # to show a random sample, instead of .head() always showing the top 5
Out[6]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Last_Updated Android_Ver
9787 FOX ENTERTAINMENT 3.70 197774 44.00 10,000,000 Free 0 Mature 17+ Entertainment June 25, 2018 4.4 and up
1757 Tricky Bike Stunt Rider DX FAMILY 4.40 16 45.00 500 Free 0 Everyone Simulation;Education July 29, 2018 4.0 and up
2223 BF Beautiful Nature FAMILY 3.60 5 2.50 1,000 Free 0 Mature 17+ Entertainment June 11, 2015 3.0 and up
155 ear super hearing MEDICAL NaN 0 1.40 5 Free 0 Everyone Medical July 31, 2018 4.0 and up
7324 Identity V GAME 4.20 109263 38.00 1,000,000 Free 0 Teen Action July 10, 2018 4.1 and up

Drop Unused Columns¶

Challenge: Remove the columns called Last_Updated and Android_Version from the DataFrame. We will not use these columns.

In [7]:
df_apps.drop(columns=['Last_Updated', 'Android_Ver'], inplace=True)
df_apps.sample(5)
Out[7]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
6697 Network Signal Info TOOLS 4.10 33926 17.00 1,000,000 Free 0 Everyone Tools
7342 Learn JavaScript EDUCATION 4.60 25183 5.40 1,000,000 Free 0 Everyone Education
10666 Garena Free Fire GAME 4.50 5534114 53.00 100,000,000 Free 0 Teen Action
9877 Rolly Vortex GAME 4.40 98123 40.00 10,000,000 Free 0 Teen Arcade
9664 Operate Now: Hospital FAMILY 4.30 254861 52.00 10,000,000 Free 0 Teen Simulation

Find and Remove NaN values in Ratings¶

Challenge: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called df_apps_clean that does not include these rows.

In [8]:
df_apps.Rating.isna().sum()
Out[8]:
1474

1474 rows have NaN as rating value.

Visualize the rows with NaN values¶

In [9]:
df_apps.loc[df_apps.Rating.isna()].sample(5)
Out[9]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
483 FP Live COMMUNICATION NaN 0 3.30 10 Free 0 Teen Communication
219 DS-Admin FAMILY NaN 1 21.00 10 Free 0 Everyone Education
2078 AX-PIC Pedidos TOOLS NaN 1 3.40 1,000 Free 0 Everyone Tools
2267 Fahrschule DW FAMILY NaN 5 3.90 1,000 Free 0 Everyone Education
1201 AO-EVENT BUSINESS NaN 0 42.00 100 Free 0 Everyone Business

We see that NaN values in ratings are associated with low reviews and lows installs. That makes sense.

In [10]:
df_apps_clean = df_apps.dropna()
df_apps_clean.shape
Out[10]:
(9367, 10)

9367 + 1474 = 10841 rows initially -> OK

Find and Remove Duplicates¶

Challenge: Are there any duplicates in data? Check for duplicates using the .duplicated() function. How many entries can you find for the "Instagram" app? Use .drop_duplicates() to remove any duplicates from df_apps_clean.

In [11]:
df_apps_clean.duplicated().tail()
Out[11]:
10836    False
10837    False
10838    False
10839     True
10840    False
dtype: bool
In [12]:
print(f"{df_apps_clean.duplicated().sum()} entries are pure duplicates.")
476 entries are pure duplicates.
In [13]:
df_apps_clean.loc[df_apps_clean.duplicated()].sort_values(by='App')
Out[13]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
6371 10 Best Foods for You HEALTH_AND_FITNESS 4.00 2490 3.80 500,000 Free 0 Everyone 10+ Health & Fitness
7159 1800 Contacts - Lens Store MEDICAL 4.70 23160 26.00 1,000,000 Free 0 Everyone Medical
2604 2017 EMRA Antibiotic Guide MEDICAL 4.40 12 3.80 1,000 Paid $16.99 Everyone Medical
5140 21-Day Meditation Experience HEALTH_AND_FITNESS 4.40 11506 15.00 100,000 Free 0 Everyone Health & Fitness
946 420 BZ Budeze Delivery MEDICAL 5.00 2 11.00 100 Free 0 Mature 17+ Medical
... ... ... ... ... ... ... ... ... ... ...
9635 theScore: Live Sports Scores, News, Stats & Vi... SPORTS 4.40 133833 34.00 10,000,000 Free 0 Everyone 10+ Sports
9634 theScore: Live Sports Scores, News, Stats & Vi... SPORTS 4.40 133833 34.00 10,000,000 Free 0 Everyone 10+ Sports
9632 theScore: Live Sports Scores, News, Stats & Vi... SPORTS 4.40 133825 34.00 10,000,000 Free 0 Everyone 10+ Sports
10156 trivago: Hotels & Travel TRAVEL_AND_LOCAL 4.20 219848 12.00 50,000,000 Free 0 Everyone Travel & Local
10159 trivago: Hotels & Travel TRAVEL_AND_LOCAL 4.20 219848 12.00 50,000,000 Free 0 Everyone Travel & Local

476 rows × 10 columns

-> Careful there, even if a row is seen only once when displaying the duplicates, it is actually at least twice in the original df.

By default, each column must be identical for .duplicated() to consider two rows duplicates. But we can consider only some columns. E.g. same App name, same Type and same Price

In [14]:
print(f"In this case, {df_apps_clean.duplicated(subset=['App', 'Type', 'Price']).sum()} entries are duplicates.")
In this case, 1168 entries are duplicates.
In [15]:
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price']) # same subset used for drop_duplicates()
df_apps_clean.duplicated().any()
Out[15]:
False
In [16]:
df_apps_clean.shape
Out[16]:
(8199, 10)

Duplicates removed, now left with 8199 rows.

What else should I know about the data?¶

So we can see that 13 different features were originally scraped from the Google Play Store.

  • Obviously, the data is just a sample out of all the Android apps. It doesn't include all Android apps of which there are millions.

  • I’ll assume that the sample is representative of the App Store as a whole. This is not necessarily the case as, during the web scraping process, this sample was served up based on geographical location and user behaviour of the person who scraped it - in our case Lavanya Gupta.

  • The data was compiled around 2017/2018. The pricing data reflect the price in USD Dollars at the time of scraping. (developers can offer promotions and change their app’s pricing).

  • I’ve converted the app’s size to a floating-point number in MBs. If data was missing, it has been replaced by the average size for that category.

  • The installs are not the exact number of installs. If an app has 245,239 installs then Google will simply report an order of magnitude like 100,000+. I’ve removed the '+' and we’ll assume the exact number of installs in that column for simplicity.

Here’s what you would see under an Android app listing if you go to a listing on the Google Play Store:

Find Highest Rated Apps¶

Challenge: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [17]:
df_apps_clean.Rating.max()
Out[17]:
5.0
In [18]:
df_apps_clean.loc[df_apps_clean.Rating == 5].App.count()
Out[18]:
271

271 Apps have the highest rating of 5

In [19]:
df_apps_clean.loc[df_apps_clean.Rating == 5].sort_values('Reviews').head()
Out[19]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
1441 DH-UFO FAMILY 5.00 1 59.00 100 Free 0 Everyone Entertainment
310 DT CLOTHINGS SHOPPING 5.00 1 7.90 10 Free 0 Everyone Shopping
312 30WPM Amateur ham radio Koch CW Morse code tra... FAMILY 5.00 1 3.70 10 Paid $1.49 Everyone Education
1272 CG Prints PHOTOGRAPHY 5.00 1 2.30 100 Free 0 Everyone Photography
321 FK CLASSIC FOR YOU BUSINESS 5.00 1 3.50 10 Free 0 Everyone Business

A lot of them have a 5.0 rating but have only one review!

Find 5 Largest Apps in terms of Size (MBs)¶

Challenge: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?

In [20]:
df_apps_clean.Size_MBs.max()
Out[20]:
100.0
In [21]:
df_apps_clean.loc[df_apps_clean.Size_MBs == 100].App.count()
Out[21]:
14

14 apps have the same largest size value (100 MB), this must be the limit in place.

Find the 5 App with Most Reviews¶

Challenge: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [22]:
df_apps_clean.sort_values('Reviews', ascending=False)[:5]
Out[22]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
10805 Facebook SOCIAL 4.10 78158306 5.30 1,000,000,000 Free 0 Teen Social
10785 WhatsApp Messenger COMMUNICATION 4.40 69119316 3.50 1,000,000,000 Free 0 Everyone Communication
10806 Instagram SOCIAL 4.50 66577313 5.30 1,000,000,000 Free 0 Teen Social
10784 Messenger – Text and Video Chat for Free COMMUNICATION 4.00 56642847 3.50 1,000,000,000 Free 0 Everyone Communication
10650 Clash of Clans GAME 4.60 44891723 98.00 100,000,000 Free 0 Everyone 10+ Strategy
In [23]:
top_50_reviews = df_apps_clean.sort_values('Reviews', ascending=False)[:50]
top_50_reviews.loc[top_50_reviews.Type == 'Paid']
Out[23]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
In [24]:
print(f"Are there any paid apps among the top 50? {top_50_reviews.loc[top_50_reviews.Type == 'Paid'].App.any()}")
Are there any paid apps among the top 50? False

Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings¶

First, we’ll count the number of occurrences of each Content Rating:

In [25]:
ratings = df_apps_clean.Content_Rating.value_counts().to_frame()
ratings
Out[25]:
Content_Rating
Everyone 6621
Teen 912
Mature 17+ 357
Everyone 10+ 305
Adults only 18+ 3
Unrated 1

The first step in creating charts with plotly is to import plotly.express. This is the fastest way to create a beautiful graphic with a minimal amount of code in plotly.

In [26]:
import plotly.express as px

Renaming the columns:¶

In [27]:
ratings.rename(columns={'Content_Rating': 'App_Count'}, inplace=True)
ratings.index.name = 'Content_Rating'
ratings
Out[27]:
App_Count
Content_Rating
Everyone 6621
Teen 912
Mature 17+ 357
Everyone 10+ 305
Adults only 18+ 3
Unrated 1
In [28]:
# fig = px.pie(ratings, values='App_Count', names='Content_Rating') # doesn't work -> "not the name of a column"
fig = px.pie(ratings, values='App_Count', names=ratings.index)
fig.show()

For further tuning, we call fig.update_traces to set other parameters of the chart (you can also use fig.update_layout for changing the layout).

Examples:

fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_traces(textposition='outside', textinfo='percent+label')

See the Plotly pie documentation and search for "update_traces" to see all possible **kwargs for fig.update_traces

In [29]:
fig = px.pie(ratings, values='App_Count', names=ratings.index, title='App by Content Rating', height=500)  # adding title and height
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

Donut chart:¶

In [30]:
fig = px.pie(ratings, values='App_Count', names=ratings.index, title='App by Content Rating', height=500,
            hole=0.6)
fig.update_traces(textposition='inside', textinfo='percent', textfont_size=15) # changing the style
fig.show()

Numeric Type Conversion: Examine the Number of Installs¶

Challenge: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [31]:
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB

I could also use the function .describe() to get information on a Series (e.g. df_apps_clean.Installs.describe())

In [32]:
df_apps_clean.Installs
Out[32]:
21                   1
28                   1
47                   1
82                   5
99                   5
             ...      
10824    1,000,000,000
10828    1,000,000,000
10829    1,000,000,000
10831    1,000,000,000
10835    1,000,000,000
Name: Installs, Length: 8199, dtype: object
In [33]:
type(df_apps_clean.Installs[28])
Out[33]:
str

The "Installs" columns contains strings.

In [34]:
count_installs = df_apps_clean.Installs.value_counts().to_frame(name='App_Count')
count_installs.index.name = 'Nb_of_Installs'
count_installs.sample(3)
Out[34]:
App_Count
Nb_of_Installs
10,000 988
1 3
500,000 504
In [35]:
# No need for a for loop, just using .str allows us to use string methods on a Series
count_installs.index = count_installs.index.str.replace(',', '')
count_installs.sample(3)
Out[35]:
App_Count
Nb_of_Installs
100000 1096
10000000 933
5000000 607
In [36]:
count_installs.index = pd.to_numeric(count_installs.index)
count_installs.sort_values('Nb_of_Installs', inplace=True)
count_installs.head()
Out[36]:
App_Count
Nb_of_Installs
1 3
5 9
10 69
50 56
100 303

Alternatively, instead of using df_apps_clean.Installs.value_counts() to get the number of apps at each level of installations, we could also have used .groupby().count():

In [37]:
count_installs_v2 = df_apps_clean[['App', 'Installs']].groupby('Installs').count()
count_installs_v2.head()
Out[37]:
App
Installs
1 3
1,000 698
1,000,000 1417
1,000,000,000 20
10 69

Also, another method exists to convert a Series from one type to another: .astype()

In [38]:
type(count_installs.index[5])
Out[38]:
numpy.int64
In [39]:
type(count_installs_v2.index[5])
Out[39]:
str
In [40]:
type(count_installs.index.astype(str)[5])
Out[40]:
str
In [41]:
type(count_installs_v2.index.astype(str).str.replace(",", "").astype(int)[5])
Out[41]:
numpy.int64

So essentially, we can use Series.astype(str) to convert it to Strings. And we can also use Series.astype(int) instead of pd.to_numeric(Series)

Display format for integer¶

In [42]:
df_apps_clean.Installs = df_apps_clean.Installs.str.replace(",", "").astype(int)
In [43]:
type(df_apps_clean.Installs[82])
Out[43]:
numpy.int32
In [44]:
df_apps_clean.sample(2)
Out[44]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
3608 BJJ Roadmap by Stephan Kesting SPORTS 4.60 969 3.60 10000 Free 0 Everyone Sports
3195 kick the buddy GAME 3.30 294 19.00 10000 Free 0 Everyone Arcade

When displaying the df, we don't have any thousand separators for the integer values (Reviews and Installs). So we can use this code:

df_apps_clean.Installs = df_apps_clean.Installs.map('{:,}'.format)

It works well, and displays as it should. But then, we're dealing with string values again. It's possible to create a variable dedicated for display only though.

Otherwise we can also use this method (formatters initialised in Notebook-Presentation)

formatters = build_formatters(df_apps_clean, num_format)   # We build the formatter with the df
hmtl_code = df_apps_clean.to_html(formatters=formatters)   # We create the HTML code to display the table the way we want
HTML(html_code)                                            # We use the HTML function to render the code

This last method will display the WHOLE table though... Note: HTML code can used to display the table in a website for instance.

Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate¶

Let's examine the Price column more closely.

Challenge: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the df_apps_clean DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?

The most expensive apps sub $250¶

First, we convert the Price into floats

In [45]:
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price.str.replace("$", "", regex=False))

-> See pandas .str.replace() Documentation about the regex argument (if not set to False or True, I get a warning)

Then investigate the top 20 most expensive apps in the dataset.

In [46]:
df_apps_clean.sort_values('Price', ascending=False)[:20]
Out[46]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
3946 I'm Rich - Trump Edition LIFESTYLE 3.60 275 7.30 10000 Paid 400.00 Everyone Lifestyle
2461 I AM RICH PRO PLUS FINANCE 4.00 36 41.00 1000 Paid 399.99 Everyone Finance
4606 I Am Rich Premium FINANCE 4.10 1867 4.70 50000 Paid 399.99 Everyone Finance
3145 I am rich(premium) FINANCE 3.50 472 0.94 5000 Paid 399.99 Everyone Finance
3554 💎 I'm rich LIFESTYLE 3.80 718 26.00 10000 Paid 399.99 Everyone Lifestyle
5765 I am rich LIFESTYLE 3.80 3547 1.80 100000 Paid 399.99 Everyone Lifestyle
1946 I am rich (Most expensive app) FINANCE 4.10 129 2.70 1000 Paid 399.99 Teen Finance
2775 I Am Rich Pro FAMILY 4.40 201 2.70 5000 Paid 399.99 Everyone Entertainment
3221 I am Rich Plus FAMILY 4.00 856 8.70 10000 Paid 399.99 Everyone Entertainment
3114 I am Rich FINANCE 4.30 180 3.80 5000 Paid 399.99 Everyone Finance
1331 most expensive app (H) FAMILY 4.30 6 1.50 100 Paid 399.99 Everyone Entertainment
2394 I am Rich! FINANCE 3.80 93 22.00 1000 Paid 399.99 Everyone Finance
3897 I Am Rich FAMILY 3.60 217 4.90 10000 Paid 389.99 Everyone Entertainment
2193 I am extremely Rich LIFESTYLE 2.90 41 2.90 1000 Paid 379.99 Everyone Lifestyle
3856 I am rich VIP LIFESTYLE 3.80 411 2.60 10000 Paid 299.99 Everyone Lifestyle
2281 Vargo Anesthesia Mega App MEDICAL 4.60 92 32.00 1000 Paid 79.99 Everyone Medical
1407 LTC AS Legal MEDICAL 4.00 6 1.30 100 Paid 39.99 Everyone Medical
2629 I am Rich Person LIFESTYLE 4.20 134 1.80 1000 Paid 37.99 Everyone Lifestyle
2481 A Manual of Acupuncture MEDICAL 3.50 214 68.00 1000 Paid 33.99 Everyone Medical
4264 Golfshot Plus: Golf GPS SPORTS 4.10 3387 25.00 50000 Paid 29.99 Everyone Sports

Many ridiculous "I am rich" apps at the top. We remove all apps that cost more than $250.

In [47]:
df_apps_clean = df_apps_clean.loc[df_apps_clean.Price < 250]
df_apps_clean.sort_values('Price', ascending=False)[:5]
Out[47]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
2281 Vargo Anesthesia Mega App MEDICAL 4.60 92 32.00 1000 Paid 79.99 Everyone Medical
1407 LTC AS Legal MEDICAL 4.00 6 1.30 100 Paid 39.99 Everyone Medical
2629 I am Rich Person LIFESTYLE 4.20 134 1.80 1000 Paid 37.99 Everyone Lifestyle
2481 A Manual of Acupuncture MEDICAL 3.50 214 68.00 1000 Paid 33.99 Everyone Medical
2463 PTA Content Master MEDICAL 4.20 64 41.00 1000 Paid 29.99 Everyone Medical

When we look at the top 5 most expensive apps now, we see that 4 out of 5 are medical apps.

Highest Grossing Paid Apps (ballpark estimate)¶

In [48]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Price * df_apps_clean.Installs

We could also use the .mul() function to multiply two Series together: df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)

In [49]:
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]
Out[49]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Revenue_Estimate
9220 Minecraft FAMILY 4.50 2376564 19.00 10000000 Paid 6.99 Everyone 10+ Arcade;Action & Adventure 69,900,000.00
8825 Hitman Sniper GAME 4.60 408292 29.00 10000000 Paid 0.99 Mature 17+ Action 9,900,000.00
7151 Grand Theft Auto: San Andreas GAME 4.40 348962 26.00 1000000 Paid 6.99 Mature 17+ Action 6,990,000.00
7477 Facetune - For Free PHOTOGRAPHY 4.40 49553 48.00 1000000 Paid 5.99 Everyone Photography 5,990,000.00
7977 Sleep as Android Unlock LIFESTYLE 4.50 23966 0.85 1000000 Paid 5.99 Everyone Lifestyle 5,990,000.00
6594 DraStic DS Emulator GAME 4.60 87766 12.00 1000000 Paid 4.99 Everyone Action 4,990,000.00
6082 Weather Live WEATHER 4.50 76593 4.75 500000 Paid 5.99 Everyone Weather 2,995,000.00
7954 Bloons TD 5 FAMILY 4.60 190086 94.00 1000000 Paid 2.99 Everyone Strategy 2,990,000.00
7633 Five Nights at Freddy's GAME 4.60 100805 50.00 1000000 Paid 2.99 Teen Action 2,990,000.00
6746 Card Wars - Adventure Time FAMILY 4.30 129603 23.00 1000000 Paid 2.99 Everyone 10+ Card;Action & Adventure 2,990,000.00

This is the top 10 highest-grossing paid apps. We see that 7 out the top 10 highest-grossing apps are games.

Plotly Bar Charts & Scatter Plots: Analysing App Categories¶

We can find the number of different categories like so:

In [50]:
df_apps_clean.Category.nunique()
Out[50]:
33
In [51]:
top10_category = df_apps_clean.Category.value_counts()[:10].to_frame()
top10_category.rename(columns={'Category': 'Count'}, inplace=True)
top10_category.index.name = 'Category'
top10_category
Out[51]:
Count
Category
FAMILY 1606
GAME 910
TOOLS 719
PRODUCTIVITY 301
PERSONALIZATION 298
LIFESTYLE 297
FINANCE 296
MEDICAL 292
PHOTOGRAPHY 263
BUSINESS 262

Vertical Bar Chart - Highest Competition (Number of Apps)¶

In [52]:
bar = px.bar(top10_category, x=top10_category.index, y='Count', height=500)
bar.show()

Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.

But what if we look at it from a different perspective? What matters is not just the total number of apps in the category but how often apps are downloaded in that category. This will give us an idea of how popular a category is. First, we have to group all our apps by category and sum the number of installations:

In [53]:
category_installs = df_apps_clean[['Category', 'Installs']].groupby('Category').sum()
category_installs.sort_values('Installs', inplace=True)
category_installs[:3]
Out[53]:
Installs
Category
EVENTS 15949410
BEAUTY 26916200
PARENTING 31116110

We could also have used the .agg() function:

In [54]:
category_installs = df_apps_clean.groupby('Category').agg({'Installs': 'sum'})
category_installs.sort_values('Installs', ascending=True, inplace=True)
category_installs[:2]
Out[54]:
Installs
Category
EVENTS 15949410
BEAUTY 26916200

Horizontal Bar Chart - Most Popular Categories (Highest Downloads)¶

We can now create a horizontal bar chart, simply by adding the orientation parameter:

In [55]:
# Careful, x and y are reversed here
h_bar = px.bar(category_installs, x='Installs', y=category_installs.index, orientation='h', height=500,
              title='Category Popularity') # custom title
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')  # custom axis labels
h_bar.show()

Now we see that Games and Tools are actually the most popular categories. If we plot the popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. Do few apps have most of the downloads or are the downloads spread out over many apps?

Category Concentration - Downloads vs. Competition¶

Challenge:

  • First, create a DataFrame that has the number of apps in one column and the number of installs in another:

  • Then use the plotly express examples from the documentation alongside the .scatter() API referenceto create scatter plot that looks like this.

Hint: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [56]:
cat_number = df_apps_clean.groupby('Category').agg({'App': 'count', 'Installs': 'sum'})
cat_number.sort_values('Installs', ascending=False, inplace=True)
cat_number[:6]
Out[56]:
App Installs
Category
GAME 910 13858762717
COMMUNICATION 257 11039241530
TOOLS 719 8099724500
PRODUCTIVITY 301 5788070180
SOCIAL 203 5487841475
PHOTOGRAPHY 263 4649143130
In [57]:
scatter = px.scatter(cat_number, x='App', y='Installs', 
                     size='App', color='Installs', height=600, title='Category Concentration', 
                     log_y=True, 
                     hover_name=cat_number.index)  # so that the Category name appears when we hover over a dot.
scatter.update_layout(xaxis_title='Number of Apps (Lower=More Concentrated)', yaxis_title='Installs')
# scatter.update_layout(yaxis=dict(type='log'))    Also possible to use this for the log axis
scatter.show()

What we see is that the categories like Family, Tools, and Game have many different apps sharing a high number of downloads. But for the categories like video players and entertainment, all the downloads are concentrated in very few apps.

Extracting Nested Data from a Column¶

Challenge: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's .stack() method.

In [58]:
df_apps_clean.Genres.nunique()
Out[58]:
114

If we look at the number of unique values in the Genres column we get 114. But this is not accurate if we have nested data like we do here. We can see this using .value_counts() and looking at the values that just have a single entry. There we see that the semi-colon (;) separates the genre names.

In [59]:
df_apps_clean.Genres.value_counts().to_frame()
Out[59]:
Genres
Tools 718
Entertainment 467
Education 429
Productivity 301
Personalization 298
... ...
Adventure;Brain Games 1
Travel & Local;Action & Adventure 1
Art & Design;Pretend Play 1
Music & Audio;Music & Video 1
Lifestyle;Pretend Play 1

114 rows × 1 columns

We somehow need to separate the genre names to get a clear picture. This is where the string’s .split() method comes in handy, with expand=True

In [60]:
stack_init = df_apps_clean.Genres.str.split(";", regex=False, expand=True)
stack_init.rename(columns={0: 'Genre 1', 1: 'Genre 2'}, inplace=True)
stack_init
Out[60]:
Genre 1 Genre 2
21 Medical None
28 Arcade None
47 Arcade None
82 Arcade None
99 Medical None
... ... ...
10824 Productivity None
10828 Video Players & Editors None
10829 Video Players & Editors None
10831 News & Magazines None
10835 Arcade None

8184 rows × 2 columns

From one Series, we create a DataFrame which splitted the Genre in two, now we have Genre 1 and Genre 2.

After we’ve separated our genre names based on the semi-colon, we can add them all into a single column with .stack() and then use .value_counts().

.stack() -> Stack the prescribed level(s) from columns to index.

In [61]:
stack = stack_init.stack().to_frame()
stack.rename(columns={0: 'Genre_Name'}, inplace=True)
stack.sample(5)
Out[61]:
Genre_Name
3792 Genre 1 Photography
8926 Genre 1 Sports
1790 Genre 1 Events
663 Genre 2 Education
7720 Genre 1 Productivity

Now we've got all Genres, 1 and 2, listed under one same column, that we can name "Genre_Name".

With this data, we can now really count the occurence of each Genre.

In [62]:
num_genres = stack.value_counts().to_frame().rename(columns={0: "Nb_of_Apps"})
num_genres.sample(5)
Out[62]:
Nb_of_Apps
Genre_Name
Comics 54
Trivia 28
Social 203
Role Playing 111
Lifestyle 298
In [63]:
print(f"We have {len(num_genres.Nb_of_Apps)} existing Genres.")
We have 53 existing Genres.
In [64]:
print(f'The "num_genres" dataframe index has {num_genres.index.nlevels} levels.')
The "num_genres" dataframe index has 1 levels.

Only 1 level index but I get an error in Plotly:

Argument 'x' is a pandas MultiIndex. pandas MultiIndex is not supported by plotly express.

So, we reset the index to avoid any issue.

In [65]:
num_genres = num_genres.reset_index()
num_genres.sample(3)
Out[65]:
Genre_Name Nb_of_Apps
41 Comics 54
12 Communication 258
30 Food & Drink 94

Colour Scales in Plotly Charts - Competition in Genres¶

Challenge: Can you create this chart with the Series containing the genre data?

Try experimenting with the built in colour scales in Plotly. You can find a full list here.

  • Find a way to set the colour scale using the color_continuous_scale parameter.
  • Find a way to make the color axis disappear by using coloraxis_showscale.
In [66]:
bar = px.bar(num_genres[:15], x='Genre_Name', y='Nb_of_Apps', height=500, color='Nb_of_Apps', title='Top Genres', 
             color_continuous_scale='Agsunset')  # using a specific color scale
bar.update_layout(xaxis_title='Genre', yaxis_title='Number of Apps', 
                 coloraxis_showscale=False)  # hiding the legend/color axis on the right of the figure
bar.show()

Grouped Bar Charts: Free vs. Paid Apps per Category¶

Now that we’ve looked at the total number of apps per category and the total number of apps per genre, let’s see what the split is between free and paid apps.

In [67]:
df_apps_clean.Type.value_counts().to_frame()
Out[67]:
Type
Free 7595
Paid 589

We see that the majority of apps are free on the Google Play Store. But perhaps some categories have more paid apps than others. Let’s investigate. We can group our data first by Category and then by Type. Then we can add up the number of apps per each type.

In [68]:
df_free_vs_paid = df_apps_clean[['Category', 'Type']].groupby('Category').value_counts().to_frame()
df_free_vs_paid.rename(columns={0: 'Nb_of_Apps'}, inplace=True)
df_free_vs_paid
Out[68]:
Nb_of_Apps
Category Type
ART_AND_DESIGN Free 58
Paid 3
AUTO_AND_VEHICLES Free 72
Paid 1
BEAUTY Free 42
... ... ...
TRAVEL_AND_LOCAL Paid 8
VIDEO_PLAYERS Free 144
Paid 4
WEATHER Free 65
Paid 7

61 rows × 1 columns

The function .groupby() can actually take more than one column as argument:

df_apps_clean[['Category', 'Type']].groupby(["Category", "Type"]).value_counts().to_frame()

or using the agg() is similar but saves us some code here:

In [69]:
df_apps_clean.groupby(["Category", "Type"]).agg({'App': 'count'})[:4]
Out[69]:
App
Category Type
ART_AND_DESIGN Free 58
Paid 3
AUTO_AND_VEHICLES Free 72
Paid 1

Finally, by using as_index=False in groupby(), we push all the data into columns rather than end up with our Categories as the index.

In [70]:
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid[:4]
Out[70]:
Category Type App
0 ART_AND_DESIGN Free 58
1 ART_AND_DESIGN Paid 3
2 AUTO_AND_VEHICLES Free 72
3 AUTO_AND_VEHICLES Paid 1

Using as_index=False we push all the data into columns rather than end up with our Categories as the index.

Challenge: Use the plotly express bar chart examples and the .bar() API reference to create this bar chart:

You'll want to use the df_free_vs_paid DataFrame that you created above that has the total number of free and paid apps per category.

See if you can figure out how to get the look above by changing the categoryorder to 'total descending' as outlined in the documentation here here.

In [71]:
g_bar = px.bar(df_free_vs_paid, x='Category', y='App', title='Free vs Paid Apps by Category', height=500, 
              color='Type', log_y=True)  # color for each type, and log y
g_bar.update_layout(xaxis_title='Category', yaxis_title='Number of Apps', 
                    barmode='group')  # bars grouped instead of stacked
g_bar.update_xaxes(categoryorder='total descending')  # arranging the bars by order
g_bar.show()

What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.

But this leads to many more questions:

  • How much should you charge? What are other apps charging in that category?

  • How much revenue could you make?

  • And how many downloads are you potentially giving up because your app is paid?

Let’s try and answer these questions with some Box plots. Box plots show us some handy descriptive statistics in a graph - things like the median value, the maximum value, the minimum value, and some quartiles.

Plotly Box Plots: Lost Downloads for Paid Apps¶

Challenge: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the Box Plots Guide and the .box API reference to create the following chart.

In [72]:
box = px.box(df_apps_clean, x='Type', y='Installs', color='Type', log_y=True, height=500, 
             points='all', notched=True,   # points -> all, and notched=True
             title='How Many Downloads are Paid Apps Giving Up?')
box.show()

From the hover text in the chart, we see that the median number of downloads for free apps is 500,000, while the median number of downloads for paid apps is around 5,000! This is massively lower.

But does this mean we should give up on selling a paid app? Let’s see how much revenue we would estimate per category.

Plotly Box Plots: Revenue by App Category¶

Challenge: See if you can generate this chart:

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

In [73]:
df_paid_apps = df_apps_clean.loc[df_apps_clean.Type == 'Paid']
In [74]:
box = px.box(df_paid_apps, x='Category', y='Revenue_Estimate', title='How Much Can Paid Apps Earn?', log_y=True, height=500)
box.update_layout(yaxis_title='Paid App Ballpark Revenue')
box.update_xaxes(categoryorder='min ascending')  # special ordering of the boxes -> look it up for all the category orders
box.show()

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs \$30,000 or thereabouts, does the average photography app recoup its development costs?

If an Android app costs \$30,000 to develop, then the average app in very few categories would cover that development cost. The median paid photography app earned about \\$20,000. Many more app’s revenues were even lower - meaning they would need other sources of revenue like advertising or in-app purchases to make up for their development costs. However, certain app categories seem to contain a large number of outliers that have much higher (estimated) revenue - for example in Medical, Personalisation, Tools, Game, and Family.

So, if you were to list a paid app, how should you price it? To help you decide we can look at how your competitors in the same category price their apps.

How Much Can You Charge? Examine Paid App Pricing Strategies by Category¶

Challenge: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using {categoryorder':'max descending'} to sort the categories.

In [75]:
print(f"The median price for a paid add is ${df_paid_apps.Price.median()}")
The median price for a paid add is $2.99
In [76]:
box = px.box(df_paid_apps, x='Category', y='Price', title='Price per Category', log_y=True, height=500)
box.update_layout(yaxis_title='Paid App Price')
box.update_xaxes(categoryorder='max descending')  # special ordering of the boxes -> look it up for all the category orders
box.show()

Some categories have higher median prices than others. This time we see that Medical apps have the most expensive apps as well as a median price of \$5.49. In contrast, Personalisation apps are quite cheap on average at \\$1.49. Other categories which higher median prices are Business (\$4.99) and Dating (\\$6.99). It seems like customers who shop in these categories are not so concerned about paying a bit extra for their apps.

Learning Points & Summary¶

In this lesson we looked at how to:

  • How to use .info() or .dtypes to get the types of data we're dealing with

  • Pull a random sample from a DataFrame using .sample()

  • How to find duplicate entries with .duplicated() and .drop_duplicates()

  • How to convert data types with Series.astype(str), Series.astype(int) and pd.to_numeric(Series)

  • How to use plotly to generate beautiful pie, donut, and bar charts as well as box and scatter plots: .pie(), .bar(), .box(), .scatter(), .update_layout(), .update_traces() for pie charts mainly, .update_xaxes()

  • How to use the .display() and .HTML() functions (to import first)

  • How to use build formatters and use it to display integers in a certain way, for instance with thousand separators

  • How to render or transform a Series with a special format df_apps_clean.Installs.map('{:,}'.format), knowing it has become a String Series though

  • How to use the string methods .str.split() and .str.replace() with Series, and define regex= True or False

  • How to use expand=True with the .split() method to create separate columns instead of

  • How to use nunique() to find the number of unique values in a Series

  • How to use agg() with more than one argument, e.g. df_apps_clean.groupby('Category').agg({'App': 'count', 'Installs': 'sum'})

  • How to .groupby() more than one category at the same time, e.g. df_apps_clean.groupby(["Category", "Type"]).agg({'App': 'count'}) and how to use as_index=False to have the grouped category as columns and not as index

  • How to use the .stack() method to stack from columns to index.

  • How to use DataFrame.index.nlevels to know the number of levels of an index, and to use DataFrame.reset_index()

  • How to use a double backslash \\ for dollar signs in markdowns.